Description of problem: Shutting down I/O serving node, takes 15-20 mins for IO to resume from failed over node. Version-Release number of selected component (if applicable): ganesha-2.3.1-4 How reproducible: Always Steps to Reproduce: 1. Create a 4 node cluster and configure ganesha on it. 2. Create a dist rep 6x2 volume and mount it using vers 3 and 4 on 2 clients respectively. 3. Start creating IO's (100kb files in my case) from both the mount points. 4. Shut down the node which is serving the IO. Performed the above scenario 3 times and observations are as below: 1st attempt: with vers 4, IO stopped and resumed after ~17 mins. with vers 3, IO happening continuously. 2nd attempt: with vers 4, IO stopped during grace period and started after that. with vers 3, IO stopped and started after ~15 mins. 3rd attempt: with vers 4, IO stopped and resumed after ~20 mins. with vers 3, IO happening continuously. Actual results: Shutting down I/O serving node, takes 15-20 mins for IO to resume from failed over node Expected results: IO should resume as soon as grace period finishes. Additional info:
This is most likely the same as bug 1278336.
With glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 While rebooting IO serving node,it takes around ~9 minutes for IO to resume from failover node in case of NFSV4 mount. 1. Create a 4 node cluster and configure ganesha on it. 2. Create a dist rep 6x2 volume and mount it using vers 3 and 4 on 2 clients respectively. 3. Start creating IO's (100kb files in my case) from both the mount points. 4. Shut down the node which is serving the IO. Performed the above scenario 3 times and observations are as below: 1st attempt: with vers 4, IO stopped and resumed after ~9 minutes. with vers 3, IO stopped for around ~1 minute and resumed in GRACE period itself. 2nd attempt: with vers 4, IO stopped and resumed after ~9 minutes. with vers 3, IO stopped for around ~1 minute and resumed in GRACE period itself. 3rd attempt: with vers 4, IO stopped and resumed after ~8 minutes. with vers 3, IO stopped for around ~1 minute and resumed in GRACE period itself. I tried swapping clients for nfsV3 and nfsV4,Observation was same (With NFSV4,it is taking around ~9 minutes to resume with both clients) Expected Result: IO should resume as soon as grace period finishes
Soumya, 1.Tried mounting volume on single client with NFSV4 IO resumed after ~2 minutes 2.Tried setting time out during mounting volume on single client with NFSV4 mount -t nfs -o vers=4,timeo=200 10.70.44.154:/ganeshaVol1 /mnt/ganesha1/ IO resumed after ~2 minutes
Based on Comment 9, as it takes around ~9 minutes for IO to resume from failover node in case of NFSV4 mount ,Reopening this Bug
Thanks for retesting Manisha. Frank/Dan/Matt, Do you have any comments wrt to update on comment#11 and comment#12.
Without some form of logs from the failover time, I'm not sure I can say anything.
Hi Soumya, I have edited the doc text for the release notes. Can you please take a look at it and let me know if I need any anything more.
Hi Bhavana, This bug was FAILED_QA as there was one outstanding issue. I changed the doc_text to reflect that. Please check the same. <<<< In case a volume is being accessed by heterogeneous clients (i.e, both NFSv3 and NFSv4 clients), it was observed that NFSv4 clients take longer time to recover post virtual-IP failover due to any node shutdown. Workaround: To avoid that use different VIPs for different access protocol (i.e, NFSv3 or NFSv4) access. >>>
Thanks Soumya. Added the doc text for the release notes.